Skip to content

Comments

Fix nil pointer panic when Flight SQL worker crashes during active session#223

Merged
fuziontech merged 3 commits intomainfrom
fix/flight-executor-nil-panic
Feb 17, 2026
Merged

Fix nil pointer panic when Flight SQL worker crashes during active session#223
fuziontech merged 3 commits intomainfrom
fix/flight-executor-nil-panic

Conversation

@fuziontech
Copy link
Member

Summary

  • Fixes a nil pointer dereference panic that crashes the entire control plane when a Flight SQL worker dies while a session goroutine is executing a query
  • Root cause: arrow-go's flight.(*client).Close() sets FlightServiceClient = nil on the internal struct. Since sessions share the worker's gRPC client (NewFlightExecutorFromClient), the health check goroutine closing the dead worker's client races with session goroutines still making RPCs on it
  • Adds an atomic dead flag to FlightExecutor that's checked before any RPC, plus recover() as belt-and-suspenders for the race window between the flag check and the gRPC call
  • OnWorkerCrash now marks all affected executors dead before the health check closes w.client

Test plan

  • Verify all existing tests pass (go test ./...TestDefaultMaxWorkers failure is pre-existing/CPU-count dependent)
  • Deploy to staging and confirm worker crashes no longer bring down the control plane process
  • Verify affected sessions see an error message instead of the process panicking

🤖 Generated with Claude Code

fuziontech and others added 3 commits February 17, 2026 13:41
…ssion

arrow-go's flight.Client.Close() nils out the embedded FlightServiceClient.
When a worker crashes, the health check goroutine closes the shared gRPC
client while session goroutines are still using it, causing a nil pointer
dereference in GetFlightInfo that crashes the entire control plane.

Fix with two layers of defense:
- Add atomic `dead` flag to FlightExecutor, checked before any RPC call
- Add recover() in QueryContext/ExecContext to catch the panic in the race
  window between the flag check and the actual RPC
- Mark all affected executors dead in OnWorkerCrash before the client is closed

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…add tests

1. Narrow recoverClientPanic to only catch nil pointer dereferences;
   re-panic on unrelated errors to preserve stack traces.

2. Add recoverWorkerPanic to ManagedWorker.CreateSession and DestroySession
   which access the shared gRPC client directly (same crash class).

3. Proactively close TCP connections on worker crash via connCloser so
   session goroutines exit cleanly instead of looping on ErrWorkerDead.

4. Add 16 unit tests covering MarkDead, recover behavior (nil pointer
   vs other panics), OnWorkerCrash session/connection cleanup, and
   SetConnCloser.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add recoverWorkerPanic to the doHealthCheck call in HealthCheckLoop.
  Same race: w.client.Close() from a concurrent crash/retire nils out
  FlightServiceClient while the health check goroutine calls DoAction.

- Replace nil contexts with context.Background() in tests.

- Add TestDestroySessionAfterOnWorkerCrash to verify the exact
  production sequence: OnWorkerCrash cleans up, then the deferred
  DestroySession is a safe no-op.

- Add comment explaining intentional double-close of TCP connection
  (OnWorkerCrash + handleConnection defer).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@fuziontech fuziontech merged commit 62340a6 into main Feb 17, 2026
11 checks passed
@fuziontech fuziontech deleted the fix/flight-executor-nil-panic branch February 17, 2026 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant